Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

RNA-Seq Data Analysis ◾ 181

Once you have R, EdgeR, and limma package installed, you will be ready for the next steps

of the differential analysis which can be broken down into the following steps.

5.3.7.1 Data Preparation

For differential analysis, EdgeR requires the count data file and a sample info file. We have

already created the count data file in the previous step, but we need to create the sample

info file that describes the design of the study. We can create the sample info file manually

as shown in Table 5.1. The sample info file is tab-delimited, and the first column contains

the unique sample IDs or the BAM file names. Additional columns can contain the condi-

tions or factors depending on the study design. For our example data, we can create the

sample info file by executing the following bash script while you are in the main project

directory:

cd bam

ls *.bam \

| rev \

| cut -c 5-\

|rev > tmp.txt

echo -e “sampleid\tcondition\tpatient” \

> ../features/sampleinfo.txt

awk -F ‘_’ ‘{print $1 “_” $2 “\t” $1 “\t” $2}’ \

tmp.txt@ ../features/sampleinfo.txt

rm tmp.txt

cd ../features

This script creates the sample info file from the BAM file names and saves it in the “fea-

tures” subdirectory together with the read count data. For your own data, you may need

to modify this script or you can create yours using Linux bash commands or manually.

Then, you need to open R, make the “features” directory as the working directory, and

load both limma and edgeR packages.

library(limma)

library(edgeR)

Load both the count data file and sample info file to the R session as data frame.

seqdata <- read.delim(“htcount2.txt”, stringsAsFactors=FALSE)

sampleinfo <- read.delim(“sampleinfo.txt”, stringsAsFactors=FALSE)

Run the following command to display the first rows of the count data frame:

head(seqdata)

You will notice that the first two columns are the gene symbol and the transcript IDs. The

other six columns contain the read counts. In the next step, we need to separate the count